Hedge detection as a lens on framing in the GMO debates: A position paper 



Eunsol Choi*, Chenhao Tan*, Lillian Lee*, Cristian Danescu-Niculescu-MiziF and Jennifer Spindel^ 

*Department of Computer Science, ^Department of Plant Breeding and Genetics 

Cornell University 

ec472@cornell.edu, chenhao |llee|cristian@cs. cornell.edu, jes462@cornell.edu 



(N 



O 

(N 

C 

1—5 

in 

u 

O 



> 
O 

O 

(N 



X 



Abstract 



Understanding the ways in which participants 
in public discussions frame their arguments is 
important in understanding how public opin- 
ion is formed. In this paper, we adopt the po- 
sition that it is time for more computationally- 
oriented research on problems involving fram- 
ing. In the interests of furthering that goal, 
we propose the following specific, interesting 
and, we believe, relatively accessible ques- 
tion: In the controversy regarding the use 
of genetically-modified organisms (GMOs) in 
agriculture, do pro- and anti-GMO articles dif- 
fer in whether they choose to adopt a more 
"scientific" tone? 

Prior work on the rhetoric and sociology of 
science suggests that hedging may distin- 
guish popular-science text from text written 
by professional scientists for their colleagues. 
We propose a detailed approach to studying 
whether hedge detection can be used to un- 
derstanding scientific framing in the GMO de- 
bates, and provide corpora to facilitate this 
study. Some of our preliminary analyses sug- 
gest that hedges occur less frequently in scien- 
tific discourse than in popular text, a finding 
that contradicts prior assertions in the litera- 
ture. We hope that our initial work and data 
will encourage others to pursue this promising 
line of inquiry. 

Publication venue: ACL Workshop on Extra- 
Propositional Aspects of Meaning in Compu- 
tational Linguistics, 2012 



1 Introduction 

1.1 Framing, "scientific discourse", and GMOs 
in the media 

The issue of framing (Goffman, 1974; Scheufele, 
1999; Benford and Snow, 2000) is of great im- 
portance in understanding how public opinion is 
formed. In their Annual Review of Political Science 
survey, Chong and Druckman (2007) describe fram- 
ing effects as occurring "when (often small) changes 
in the presentation of an issue or an event produce 
(sometimes large) changes of opinion" (p. 104); 
as an example, they cite a study wherein respon- 
dents answered differently, when asked whether a 
hate group should be allowed to hold a rally, depend- 
ing on whether the question was phrased as one of 
"free speech" or one of "risk of violence". 

The genesis of our work is in a framing question 
motivated by a relatively current political issue. In 
media coverage of transgenic crops and the use of 
genetically modified organisms (GMOs) in food, do 
pro-GMO vs. anti-GMO articles differ not just with 
respect to word choice, but in adopting a more "sci- 
entific" discourse, meaning the inclusion of more 
uncertainty and fewer emotionally-laden words? We 
view this as an interesting question from a text anal- 
ysis perspective (with potential applications and im- 
plications that lie outside the scope of this article). 

1.2 Hedging as a sign of scientific discourse 

To obtain a computationally manageable character- 
ization of "scientific discourse", we turned to stud- 
ies of the culture and language of science, a body 
of work spanning fields ranging from sociology to 



applied linguistics to rhetoric and communication 
(Gilbert and Mulkay, 1984; Latour, 1987; Latour 
and Woolgar, 1979; Halliday and Martin, 1993; Baz- 
erman, 1988; Fahnestock, 2004; Gross, 1990). 

One characteristic that has drawn quite a bit of 
attention in such studies is hedging (Myers, 1989; 
Hyland, 1998; Lewin, 1998; Salager-Meyer, 201 1). 1 
Hyland (1998, pg. 1) defines hedging as the ex- 
pression of "tentativeness and possibility" in com- 
munication, or, to put it another way, language cor- 
responding to "the writer withholding full commit- 
ment to statements" (pg. 3). He supplies many 
real-life examples from scientific research articles, 
including the following: 

1 . 'It seems that this group plays a critical role in 
orienting the carboxyl function' (emphasis Hy- 
land's) 

2. '...implies that phytochrome A is also not nec- 
essary for normal photomorphogenesis, at least 
under these irradiation conditions' (emphasis 
Hyland's) 

3. 'We wish to suggest a structure for the salt of 
deoxyribose nucleic acid (D.N.A.)' (emphasis 
added) 2 

Several scholars have asserted the centrality of hedg- 
ing in scientific and academic discourse, which cor- 
responds nicely to the notion of "more uncertainty" 
mentioned above. Hyland (1998, p. 6) writes, "De- 
spite a widely held belief that professional scientific 
writing is a series of impersonal statements of fact 
which add up to the truth, hedges are abundant in 
science and play a critical role in academic writing". 
Indeed, Myers (1989, p. 13) claims that in scien- 
tific research articles, "The hedging of claims is so 
common that a sentence that looks like a claim but 
has no hedging is probably not a statement of new 
knowledge". 

Not only is understanding hedges important to un- 
derstanding the rhetoric and sociology of science, 

'in linguistics, hedging has been studied since the 1970s 
(Lakoff, 1973). 

2 This example originates from Watson and Crick's land- 
mark 1953 paper. Although the sentence is overtly tentative, 
did Watson and Crick truly intend to be polite and modest in 
their claims? See Varttala (2001) for a review of arguments re- 
garding this question. 

3 Note the inclusion of the hedge "probably". 



but hedge detection and analysis — in the sense of 
identifying uncertain or uncertainly-sourced infor- 
mation (Farkas et al., 2010) — has important appli- 
cations to information extraction, broadly construed, 
and has thus become an active sub-area of natural- 
language processing. For example, the CoNLL 2010 
Shared Task was devoted to this problem (Farkas 
et al., 2010). 

Putting these two lines of research together, we 
see before us what appears to be an interesting in- 
terdisciplinary and, at least in principle, straightfor- 
ward research program: relying on the aforemen- 
tioned rhetoric analyses to presume that hedging is 
a key characteristic of scientific discourse, build a 
hedge-detection system to computationally ascertain 
which proponents in the GMO debate tend to use 
more hedges and thus, by presumption, tend to adopt 
a more "scientific" frame. 4 

1.3 Contributions 

Our overarching goal in this paper is to convince 
more researchers in NLP and computational linguis- 
tics to work on problems involving framing. We 
try to do so by proposing a specific problem that 
may be relatively accessible. Despite the apparent 
difficulty in addressing such questions, we believe 
that progress can be made by drawing on observa- 
tions drawn from previous literature across many 
fields, and integrating such work with movements 
in the computational community toward considera- 
tion of extra-propositional and pragmatic concerns. 
We have thus intentionally tried to "cover a lot of 
ground", as one referee put it, in the introductory 
material just discussed. 

Since framing problems are indeed difficult, we 
elected to narrow our scope in the hope of making 
some partial progress. Our technical goal here, at 
this workshop, where hedge detection is one of the 
most relevant topics to the broad questions we have 
raised, is not to learn to classify texts as being pro- 
vs. anti-GMO, or as being scientific or not, per se. 5 

4 However, this presumption that more hedges characterize a 
more scientific discourse has been contested. See section 2 for 
discussion and section 4.2 for our empirical investigation. 

5 Several other groups have addressed the problem of try- 
ing to identify different sides or perspectives (Lin et al., 2006; 
Hardisty et al., 2010; Beigman Klebanov et al., 2010; Ahmed 
and Xing, 2010). 



Our focus is on whether hedging specifically, con- 
sidered as a single feature, is correlated with these 
different document classes, because of the previous 
research attention that has been devoted to hedging 
in particular and because of hedging being one of the 
topics of this workshop. The point of this paper is 
thus not to compare the efficacy of hedging features 
with other types, such as bag-of-words features. Of 
course, to do so is an important and interesting di- 
rection for future work. 

In the end, we were not able to achieve satisfac- 
tory results even with respect to our narrowed goal. 
However, we believe that other researchers may be 
able to follow the plan of attack we outline below, 
and perhaps use the data we are releasing, in order 
to achieve our goal. We would welcome hearing the 
results of other people's efforts. 

2 How should we test whether hedging 
distinguishes scientific text? 

One very important point that we have not yet ad- 
dressed is: While the literature agrees on the impor- 
tance of hedging in scientific text, the relative de- 
gree of hedging in scientific vs. non-scientific text is 
a matter of debate. 

On the one side, we have assertions like those of 
Fahnestock (1986), who shows in a clever, albeit 
small-scale, study involving parallel texts that when 
scientific observations pass into popular accounts, 
changes include "removing hedges ... thus con- 
ferring greater certainty on the reported facts" (pg. 
275). Similarly, Juanillo, Jr. (2001) refers to a shift 
from a forensic style to a "celebratory" style when 
scientific research becomes publicized, and credits 
Brown (1998) with noting that "celebratory scien- 
tific discourses tend to pay less attention to caveats, 
contradictory evidence, and qualifications that are 
highlighted in forensic or empiricist discourses. By 
downplaying scientific uncertainty, it [sic] alludes to 
greater certainty of scientific results for public con- 
sumption" (Juanillo, Jr., 2001, p. 42). 

However, others have contested claims that the 
popularization process involves simplification, dis- 
tortion, hype, and dumbing down, as Myers (2003) 
colorfully puts it; he provides a critique of the rel- 
evant literature. Varttala (1999) ran a corpus anal- 
ysis in which hedging was found not just in pro- 



fessional medical articles, but was also "typical of 
popular scientific articles dealing with similar top- 
ics" (p. 195). Moreover, significant variation in use 
of hedging has been found across disciplines and au- 
thors' native language; see Salager-Meyer (2011) or 
Varttala (2001) for a review. 

To the best of our knowledge, there have been no 
large-scale empirical studies validating the hypoth- 
esis that hedges appear more or less frequently in 
scientific discourse. 

Proposed procedure Given the above, our first 
step must be to determine whether hedges are more 
or less prominent in "professional scientific" (hence- 
forth "prof-science") vs. "public science" (hence- 
forth "pop-science") discussions of GMOs. Of 
course, for a large-scale study, finding hedges re- 
quires developing and training an effective hedge de- 
tection algorithm. 

If the first step shows that hedges can indeed be 
used to effectively distinguish prof-science vs. pop- 
science discourse on GMOs, then the second step is 
to examine whether the use of hedging in pro-GMO 
articles follows our inferred "scientific" occurrence 
patterns to a greater extent than the hedging in anti- 
GMO articles. 

However, as our hedge classifier trained on the 
CoNLL dataset did not perform reliably on the dif- 
ferent domain of prof-science vs. pop-science dis- 
cussions of GMOs, we focus the main content of this 
paper on the first step. We describe data collection 
for the second step in the appendix. 

3 Data 

To accomplish the first step of our proposed pro- 
cedure outlined above, we first constructed a prof- 
science/pop-science corpus by pulling text from 
Web of Science for prof-science examples and from 
LexisNexis for pop-science examples, as described 
in Section 3.1. Our corpus will be posted online 
at https://confluence.cornell.edu/display/llresearch/ 
HedgingFramingGMOs. 

As noted above, computing the degree of hedg- 
ing in the aforementioned corpus requires access to 
a hedge-detection algorithm. We took a supervised 
approach, taking advantage of the availability of the 
CoNLL 2010 hedge-detection training and evalua- 
tion corpora, described in Section 3.2 



Dataset 


Doc type 


# docs 


# sentences 




sentence length 


Flesch reading ease 


Prof-science/pop-science corpus 


WOS 


abstracts 


648 


5596 




22.35 


23.39 


LEXIS 


(short) articles 


928 


36795 




24.92 


45.78 


Hedge-detection corpora 


Bio (train) 


abstracts, articles 


1273, 9 


14541 (18% uncertain) 




29.97 


20.77 


Bio (eval) 


articles 


15 


5003 (16% uncertain) 




31.30 


30.49 


Wiki (train) 


paragraphs 


2186 


11111 (22% uncertain) 




23.07 


35.23 


Wiki (eval) 


paragraphs 


2346 


9634 (23% uncertain) 




20.82 


31.71 



Table 1: Basic descriptive statistics for the main corpora we worked with. We created the first two. Higher Flesch 
scores indicate text that is easier to read. 



3.1 Prof-science/pop-science data: LEXIS and 
WOS 

As mentioned previously, a corpus of prof-science 
and pop-science articles is required to ascertain 
whether hedges are more prevalent in one or the 
other of these two writing styles. Since our ultimate 
goal is to look at discourse related to GMOs, we re- 
strict our attention to documents on this topic. 

Thomson Reuter's Web of Science (WOS), a 
database of scientific journal and conference arti- 
cles, was used as a source of prof-science samples. 
We chose to collect abstracts, rather than full scien- 
tific articles, because intuition suggests that the lan- 
guage in abstracts is more high-level than that in the 
bodies of papers, and thus more similar to the lan- 
guage one would see in a public debate on GMOs. 
To select for on-topic abstracts, we used the phrase 
"transgenic foods" as a search keyword and dis- 
carded results containing any of a hand-selected list 
of off-topic filtering terms (e.g., "mice" or "rats"). 
We then made use of domain expertise to manually 
remove off-topic texts. The process yielded 648 doc- 
uments for a total of 5596 sentences. 

Our source of pop-science articles was Lexis- 
Nexis (LEXIS). On-topic documents were collected 
from US newspapers using the search keywords "ge- 
netically modified foods" or "transgenic crops" and 
then imposing the additional requirement that at 
least two terms on a hand-selected list 7 be present 
in each document. After the removal of duplicates 

7 The term list: GMO, GM, GE, genetically modified, ge- 
netic modification, modified, modification, genetic engineer- 
ing, engineered, bioengineered, franken, transgenic, spliced, 
G.M.O., tweaked, manipulated, engineering, pharming, aqua- 
culture. 



and texts containing more than 2000 words to delete 
excessively long articles, our final pop-science sub- 
corpus was composed of 928 documents. 

3.2 CoNLL hedge-detection training data 8 

As described in Farkas et al. (2010), the motivation 
behind the CoNLL 2010 shared task is that "distin- 
guishing factual and uncertain information in texts is 
of essential importance in information extraction". 
As "uncertainty detection is extremely important for 
biomedical information extraction", one component 
of the dataset is biological abstracts and full arti- 
cles from the BioScope corpus (Bio). Meanwhile, 
the chief editors of Wikipedia have drawn the at- 
tention of the public to specific markers of uncer- 
tainty known as weasel words 9 : they are words or 
phrases "aimed at creating an impression that some- 
thing specific and meaningful has been said", when, 
in fact, "only a vague or ambiguous claim, or even 
a refutation, has been communicated". An example 
is "It has been claimed that the claimant has not 
been identified, so the source of the claim cannot be 
verified. Thus, another part of the dataset is a set 
of Wikipedia articles (Wiki) annotated with weasel- 
word information. We view the combined Bio+Wiki 
corpus (henceforth the CoNLL dataset) as valuable 
for developing hedge detectors, and we attempt to 
study whether classifiers trained on this data can be 
generalized to other datasets. 

3.3 Comparison 

Table 1 gives the basic statistics on the main datasets 
we worked with. Though WOS and LEXIS differ in 

8 http://www.inf. u-szeged.hu/rgai/conll2010st/ 
9 http://en. wikipedia.org/wiki/Weasel_word 



the total number of sentences, the average sentence 
length is similar. The average sentence length in Bio 
is longer than that in Wiki. The articles in WOS 
are markedly more difficult to read than the articles 
in LEXIS according to Flesch reading ease (Kincaid 
et al., 1975). 

4 Hedging to distinguish scientific text: 
Initial annotation 

As noted in Section 1, it is not a priori clear whether 
hedging distinguishes scientific text or that more 
hedges correspond to a more "scientific" discourse. 
To get an initial feeling for how frequently hedges 
occur in WOS and LEXIS, we hand-annotated a 
sample of sentences from each. In Section 4.1, we 
explain the annotation policy of the CoNLL 2010 
Shared Task and our own annotation method for 
WOS and LEXIS. After that, we move forward in 
Section 4.2 to compare the percentage of uncertain 
sentences in prof-science vs. pop-science text on 
this small hand-labeled sample, and gain some ev- 
idence that there is indeed a difference in hedge oc- 
currence rates, although, perhaps surprisingly, there 
seem to be more hedges in the pop-science texts. 

As a side benefit, we subsequently use the 
hand-labeled sample we produce to investigate the 
accuracy of an automatic hedge detector in the 
WOS+LEXIS domain; more on this in Section 5. 

4.1 Uncertainty annotation 

CoNLL 2010 Shared Task annotation policy As 

described in Farkas et al. (2010, pg. 4), the data an- 
notation polices for the CoNLL 2010 Shared Task 
were that "sentences containing at least one cue 
were considered as uncertain, while sentences with 
no cues were considered as factual", where a cue 
is a linguistic marker that in context indicates un- 
certainty. A straightforward example of a sentence 
marked "uncertain" in the Shared Task is 'Mild blad- 
der wall thickening raises the question of cystitis.' 
The annotated cues are not necessarily general, par- 
ticularly in Wiki, where some of the marked cues 
are as specific as 'some of Schumann's best choral 
writing' , 'people of the jewish tradition' , or 'certain 
leisure or cultural activities' . 

Note that "uncertainty" in the Shared Task def- 
inition also encompassed phrasing that "creates an 



impression that something important has been said, 
but what is really communicated is vague, mislead- 
ing, evasive or ambiguous ... [offering] an opinion 
without any backup or source". An example of such 
a sentence, drawn from Wikipedia and marked "un- 
certain" in the Shared Task, is 'Some people claim 
that this results in a better taste than that of other diet 
colas (most of which are sweetened with aspartame 
alone).'; Farkas et al. (2010) write, "The ... sentence 
does not specify the source of the information, it is 
just the vague term 'some people' that refers to the 
holder of this opinion". 

Our annotation policy We hand-annotated 200 
randomly-sampled sentences, half from WOS and 
half from LEXIS 10 , to gauge the frequency with 
which hedges occur in each corpus. Two annota- 
tors each followed the rules of the CoNLL 2010 
Shared Task to label sentences as certain, uncertain, 
or not a proper sentence. 1 1 The annotators agreed on 
153 proper sentences of the 200 sentences (75 from 
WOS and 78 from LEXIS). Cohen's Kappa (Fleiss, 
1981) was 0.67 on the annotation, which means that 
the consistency between the two annotators was fair 
or good. However, there were some interesting cases 
where the two annotators could not agree. For ex- 
ample, in the sentence 'Cassava is the staple food of 
tropical Africa and its production, averaged over 24 
countries, has increased more than threefold from 
1980 to 2005 ... ', one of the annotators believed 
that "more than" made the sentence uncertain. These 
borderline cases indicate that the definition of hedg- 
ing should be carefully delineated in future studies. 

4.2 Percentages of uncertain sentences 

To validate the hypothesis that prof-science articles 
contain more hedges, we computed the percentage 
of uncertain sentences in our labeled data. As shown 
in Table 2, we observed a trend contradicting ear- 
lier studies. Uncertain sentences were more frequent 
in LEXIS than in WOS, though the difference was 

10 We took steps to attempt to hide from the annotators any 
explicit clues as to the source of individual sentences: the sub- 
set of authors who did the annotation were not those that col- 
lected the data, and the annotators were presented the sentences 
in random order. 

1 'The last label was added because of a few errors in scraping 
the data. 



Dataset 


% of uncertain sentences 


wos 

LEXIS 

Bio 

Wiki 


(estimated from 7 5 -sentence sample) 20 
(estimated from 78-sentence sample) 28 

17 
23 



Table 2: Percentages of uncertain sentences. 

not statistically significant 12 (perhaps not surprising 
given the small sample size). The same trend was 
seen in the CoNLL dataset: there, too, the percent- 
age of uncertain sentences was significantly smaller 
in Bio (prof-science articles) than in Wiki. In order 
to make a stronger argument about prof-science vs 
pop-science, however, more annotation on the WOS 
and LEXIS datasets is needed. 

5 Experiments 

As stated in Section 1, our proposal requires devel- 
oping an effective hedge detection algorithm. Our 
approach for the preliminary work described in this 
paper is to re-implement Georgescul's (2010) algo- 
rithm; the experimental results on the Bio+Wiki do- 
main, given in Section 5.1, are encouraging. Then 
we use this method to attempt to validate (at a larger 
scale than in our manual pilot annotation) whether 
hedges can be used to distinguish between prof- 
science and pop-science discourse on GMOs. Un- 
fortunately, our results, given in Section 5.2, are 
inconclusive, since our trained model could not 
achieve satisfactory automatic hedge-detection ac- 
curacy on the WOS+LEXIS domain. 

5.1 Method 

We adopted the method of Georgescul (2010): Sup- 
port Vector Machine classification based on a Gaus- 
sian Radial Basis kernel function (Vapnik, 1998; Fan 
et al., 2005), employing n-grams from annotated cue 
phrases as features, as described in more detail be- 
low. This method achieved the top performance in 
the CoNLL 2010 Wikipedia hedge-detection task 
(Farkas et al., 2010), and SVMs have been proven 
effective for many different applications. We used 
the LIBSVM toolkit in our experiments 13 . 



As described in Section 3.2, there are two separate 
datasets in the CoNLL dataset. We experimented on 
them separately (Bio, Wiki). Also, to make our clas- 
sifier more generalizable to different datasets, we 
also trained models based on the two datasets com- 
bined (Bio+Wiki). As for features, we took advan- 
tage of the observation in Georgescul (2010) that the 
bag-of-words model does not work well for this task. 
We used different sets of features based on hedge 
cue words that have been annotated as part of the 
CoNLL dataset distribution 14 . The basic feature set 
was the frequency of each hedge cue word from the 
training corpus after removing stop words and punc- 
tuation and transforming words to lowercase. Then, 
we extracted unigrams, bigrams and trigrams from 
each hedge cue phrase. Table 3 shows the number 
of features in different settings. Notice that there are 
many more features in Wiki. As mentioned above, 
in Wiki, some cues are as specific as 'some of Schu- 
mann's best choral writing', 'people of the jewish 
tradition', or ' certain leisure or cultural activities'. 
Taking n-grams from such specific cues can cause 
some sentences to be classified incorrectly. 



Feature source 


#features 


Bio 

Bio (cues + bigram + trigram) 
Wiki 

Wiki (cues + bigram + trigram) 


220 
340 
3740 
10603 



Table 3: Number of features. 



Best cross-validation performance 


Dataset 


(C,7) 


P 


R 


F 


Bio 


(40, 2~ 3 ) 


84.0 


92.0 


87.8 


Wiki 


(30, 2~ 6 ) 


64.0 


76.3 


69.6 


Bio+Wiki 


(10, 2" 4 ) 


66.7 


78.3 


72.0 



Table 4: Best 5-fold cross-validation performance for Bio 
and/or Wiki after parameter tuning. As a reminder, we 
repeat that our intended final test set is the WOS+LEXIS 
corpus, which is disjoint from Bio+Wiki. 

We adopted several techniques from Georgescul 



12 Throughout, "statistical significance" refers to the student 
t-test withp < .05. 



13 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ 



For the Bio model, we used cues extracted from Bio. Like- 
wise, the Wiki model used cues from Wiki, and the Bio+Wiki 
model used cues from Bio+Wiki. 



Evaluation set 


Model 


P 


R 


F 


WOS+LEXIS 


Bio 


54 


68 


60 


WOS+LEXIS 


Wiki 


38 


54 


45 


WOS+LEXIS 


Bio+Wiki 


21 


93 


34 


Sub-corpus performance of the model based on Bio 


wos 


Bio 


58 


73 


65 


LEXIS 


Bio 


52 


64 


57 



Table 5: The upper part shows the performance on WOS 
and LEXIS based on models trained on the CoNLL 
dataset. The lower part gives the sub-corpus results for 
Bio, which provided the best performance on the full 
WOS+LEXIS corpus. 



(2010) to optimize performance through cross vali- 
dation. Specifically, we tried different combinations 
of feature sets (the cue phrases themselves, cues + 
unigram, cues + bigram, cues + trigram, cues + uni- 
gram + bigram + trigram, cues + bigram + trigram). 
We tuned the width of the RBF kernel (7) and the 
regularization parameter (C) via grid search over the 
following range of values: {2~ 9 , 2~ 8 , 2~ 7 , . . . , 2 4 } 
for 7, {1, 10, 20, 30, ... , 150} for C. We also tried 
different weighting strategies for negative and pos- 
itive classes (i.e., either proportional to the number 
of positive instances, or uniform). We performed 5- 
fold cross validation for each possible combination 
of experimental settings on the three datasets (Bio, 
Wiki, Bio+Wiki). 

Table 4 shows the best performance on all three 
datasets and the corresponding parameters. In the 
three datasets, cue+bigram+trigram provided the 
best performance, and the weighted model con- 
sistently produced superior results to the uniform 
model. The Fl measure for Bio was 87.8, which 
was satisfactory, while the Fl results for Wiki were 
69.6, which were the worst of all the datasets. 
This resonates with our observation that the task on 
Wikipedia is more subtly defined and thus requires 
a more sophisticated approach than counting the oc- 
currences of bigrams and digrams. 

5.2 Results on WOS+LEXIS 

Next, we evaluated whether our best classifier 
trained on the CoNLL dataset can be generalized to 
other datasets, in particular, the WOS and LEXIS 
corpus. Performance was measured on the 153 sen- 



Evaluation set (C, 7) P R F 

WOS + LEXIS (50, 2~ 9 ) ~68 62 65 

WOS (50, 2~ 9 ) 85 73 79 

LEXIS (50, 2~ 9 ) 57 54 56 



Table 6: Best performance after parameter tuning 
based on the 153 labeled WOS+LEXIS sentences; this 
gives some idea of the upper-bound potential of our 
Georgescul-based method. The training set is Bio, which 
gave the best performance in Table 5. 

tences on which our annotators agreed, a dataset 
that was introduced in Section 4.1. Table 5 shows 
how the best models trained on Bio, Wiki, and 
Bio+Wiki, respectively, performed on the 153 la- 
beled sentences. First, we can see that the perfor- 
mance degraded significantly compared to the per- 
formance for in-domain cross validation. Second, of 
the three different models, Bio showed the best per- 
formance. Bio+Wiki gave the worst performance, 
which hints that combining two datasets and cue 
words may not be a promising strategy: although 
Bio+Wiki shows very good recall, this can be at- 
tributed to its larger feature set, which contains all 
available cues and perhaps as a result has a very high 
false-positive rate. We further investigated and com- 
pared performance on LEXIS and WOS for the best 
model (Bio). Not surprisingly, our classifier works 
better in WOS than in LEXIS. 

It is clear that there exist domain differences be- 
tween the CoNLL dataset and WOS+LEXIS. To bet- 
ter understand the poor cross-domain performance 
of the classifier, we tuned another model based on 
the performance on the 153 labeled sentences us- 
ing Bio as training data. As we can see in Table 
6, the performance on WOS improved significantly, 
while the performance on LEXIS decreased. This 
is probably caused by the fact that WOS is a col- 
lection of scientific paper abstracts, which is more 
similar to the training corpus than LEXIS, which is 
a collection of news media articles 15 . Also, LEXIS 
articles are hard to classify even with the tuned 
model, which challenges the effectiveness of a cue- 
words frequency approach beyond professional sci- 
entific texts. Indeed, the simplicity of our reim- 

15 The Wiki model performed better on LEXIS than on WOS. 
Though the performance was not good, this result further rein- 
forces the possibility of a domain-dependence problem. 



plementation of Georgescul's algorithm seems to 
cause longer sentences to be classified as uncer- 
tain, because cue phrases (or n-grams extracted from 
cue phrases) are more likely to appear in lengthier 
sentences. Analysis of the best performing model 
shows that the false-positive sentences are signifi- 
cantly longer than the false-negative ones. 16 



Dataset 


Model 


% classified uncertain 


wos 


Bio 


16 


LEXIS 


Bio 


19 


WOS 


Tuned 


15 


LEXIS 


Tuned 


14 



Table 7: For completeness, we report here the percentage 
of uncertain sentences in WOS and LEXIS according to 
our trained classifiers, although we regard these results as 
unreliable since those classifiers have low accuracy. Bio 
refers to the best model trained on Bio only in Section 5.1, 
while Tuned refers to the model in Table 6 that is tuned 
based on the 153 labeled sentences in WOS+LEXIS. 

While the cross-domain results were not reliable, 
we produced preliminary results on whether there 
exist fewer hedges in scientific text. We can see that 
the relative difference in certain/uncertain ratios pre- 
dicted by the two different models (Bio, Tuned) are 
different in Table 7. In the tuned model, the differ- 
ence between LEXIS and WOS in terms of the per- 
centage of uncertain sentences was not statistically 
significant, while in the Bio model, their difference 
was statistically significant. Since the performance 
of our hedge classifier on the 153 hand-annotated 
WOS+LEXIS sentences was not reliable, though, 
we must abstain from making conclusive statements 
here. 

6 Conclusion and future work 

In this position paper, we advocated that researchers 
apply hedge detection not only to the classic moti- 
vation of information-extraction problems, but also 
to questions of how public opinion forms. We pro- 
posed a particular problem in how participants in de- 
bates frame their arguments. Specifically, we asked 
whether pro-GMO and anti-GMO articles differ in 
adopting a more "scientific" discourse. Inspired by 

16 Average length of true positive sentences : 28.6, false pos- 
itive sentences 35.09, false negative sentences: 22.0. 



earlier studies in social sciences relating hedging to 
texts aimed at professional scientists, we proposed 
addressing the question with automatic hedge de- 
tection as a first step. To develop a hedge clas- 
sifier, we took advantage of the CoNLL dataset 
and a small annotated WOS and LEXIS dataset. 
Our preliminary results show there may exist a gap 
which indicates that hedging may, in fact, distin- 
guish prof-science and pop-science documents. In 
fact, this computational analysis suggests the possi- 
bility that hedges occur less frequently in scientific 
prose, which contradicts several prior assertions in 
the literature. 

To confirm the argument that pop-science tends 
to use more hedging than prof-science, we need 
a hedge classifier that performs more reliably in 
the WOS and LEXIS dataset than ours does. An 
interesting research direction would be to develop 
transfer-learning techniques to generalize hedge 
classifiers for different datasets, or to develop a gen- 
eral hedge classifier relatively robust to domain dif- 
ferences. In either case, more annotated data on 
WOS and LEXIS is needed for better evaluation or 
training. 

Another strategy would be to bypass the first step, 
in which we determine whether hedges are more 
or less prominent in scientific discourse, and pro- 
ceed directly to labeling and hedge-detection in pro- 
GMO and anti-GMO texts. However, this will not 
answer the question of whether advocates in debates 
other than on GMO-related topics employ a more 
scientific discourse. Nonetheless, to aid those who 
wish to pursue this alternate strategy, we have col- 
lected two sets of opinionated articles on GMO (pro- 
and anti-); see appendix for more details. 
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After the initial collection of data, near-duplicates 
and irrelevant articles were filtered through cluster- 
ing, keyword searches and distance between word 
vectors at the document level. We have collected 
762 "anti" documents and 671 "pro" documents. 
We reduced this to a 404 "pro" and 404 "con" 
set as follows. Each retained "document" con- 
sists of only the first 200 words after excluding the 
first 50 words of documents containing over 280 
words. This was done to avoid irrelevant sections 
such as Educators have permission to reprint arti- 
cles for classroom use; other users, please contact 
editor@actionbioscience.org for reprint permission. 
See reprint policy. 

The data will be posted online at 
https://confluence.cornell.edu/display/llresearch/ 
HedgingFramingGMOs. 



7 Appendix: pro- vs. anti-GMO dataset 

Here, we describe the pro- vs. anti-GMO dataset we 
collected, in the hopes that this dataset may prove 
helpful in future research regarding the GMO de- 
bates, even though we did not use the corpus in the 
project described in this paper. 

The second step of our overall procedure out- 
lined in the introduction — that step being to ex- 
amine whether the use of hedging in pro-GMO arti- 
cles conesponds with our inferred "scientific" oc- 
currence patterns more than that in anti-GMO ar- 
ticles — requires a collection of opinionated arti- 
cles on GMOs. Our first attempt to use news me- 
dia articles (LEXIS) was unsatisfying, as we found 
many articles attempt to maintain a neutral position. 
This led us to collect documents from more strongly 
opinionated organizational websites such as Green- 
peace (anti-GMO), Non GMO Project (anti-GMO), 
or Why Biotechnology (pro-GMO). Articles were 
collected from 20 pro-GMO and 20 anti-GMO or- 
ganizational web sites. 



